CAGEF_services_slide.png

Lecture 05: Of Data Cleaning and Documentation - Conquer Regular Expressions and Challenge yourself with a 'Real' Dataset

0.1.0 A quick intro to the Introduction to R for Data Science

This 'Introduction to R for Data Science' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1020 was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

This lesson is the fifth in a 7-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.

So far we have discussed the tidyverse and its tools. You've learned a lot of the "verbs" needed to slice, format, and tidy your data. You can now convert from wide to long-format data and we've taken a side-journey into visualizing your data. Now we'll revisit dataset manipulation through text manipulation and regular expressions.

data-science-explore.png

The structure of the class is a code-along style: It is fully hands on. Prior to each lecture, the materials will be emailed to you and will also be available for download at QUERCUS, so you can spend more time coding than taking notes.

0.1.1 Class Objectives

At the end of this session you will be able to use tidyverse tools and regular expressions to tidy/clean your data.

  1. Introduction to data cleaning
  2. Regular expressions (RegEx)
  3. String manipulation tools via stringr
  4. A step-by-step example for converting a fasta file
  5. Resources

0.2.0 How do we get there?

Today we are going to be learning data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. In the next lesson we will learn how to do t-tests and perform regression and modeling in R.

spotify-howtobuildmvp.gif

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

0.4.0 Data used in this session

We have 2 data files:

  1. regex_word.docx
  2. FoxP2_primate.fasta

0.4.1 Datafile 1: regex_word.docx

This is an example file for us to start playing with the idea of regular expressions.

0.4.2 Datafile 2: FoxP2_primate.fasta

This is the main file that we'll be working with for the rest of the lecture. We'll search, replace, and manipulate data from this file after importing it into our notebooks.

0.5.0 Packages Used in This Lesson

The following packages are used in this lesson:

tidyverse (ggplot2, tidyr, dplyr, stringr)

These packages should already be installed into your Anaconda base from previous lectures and should be readily available in JupyterHub. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.


1.0.0 Data cleaning or "data munging" or "data wrangling"

Why do we need to do this?

'Raw' data is seldom (never) in a usable format. Data in tutorials or demos have already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can't get their own data in the format to use the tool they have just spent an hour learning about?

Data cleaning requires us to:

Some definitions might take this a bit farther and include normalizing data and removing outliers. In this course, we consider data cleaning as getting data into a format where we can start actively exploring our data with graphics, data normalization, etc.

Today we are going to mostly focus on the data cleaning of text. This step is crucial for taking control of your dataset and your metadata. I have included the functions I find most useful for these tasks but I encourage you to take a look at the Strings Chapter in R for Data Science for an exhaustive list of functions. We have learned how to transform data into a tidy format in lectures 2 and 3, but the prelude to transforming data is doing the grunt work of data cleaning. So let's get to it!

cleaning.gif


2.0.0 Introduction to regular expressions (RegEx)

"A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as 'write only', because regular expressions are easier to write than to read/understand. And they are not particularly easy to write." - Jenny Bryan

RegEx is a very sophisticated way to find, replace, and extract information from strings.

For our first regex exercise, use Microsoft Word to open the file "regex_word.docx". This file contains one string: "Bob and Bobby went to Runnymede Road for a run and then went apple bobbing.". Here is what we are going to do:

xkcd-1171-perl_problems.png

So why do regular expressions or 'RegEx' get so much flak if it is so powerful for text matching? Scary example: how to get an email in different programming languages http://emailregex.com/.

Writing/reading Regex is definitely one of those situations where you should annotate your code. There are many terrifying urban legends about people coming back to their code and having no idea what their code means.

yesterdays-regex.png

There are sites available to help you make up your regular expressions and validate them against text. These are usually not R specific, but they will get you close and the expression will only need a slight modification for R (like an extra backslash - described below).

ReGex testers:

https://regex101.com/
https://regexr.com/

Today we will be practicing RegEx at Simon Goring's R-specific demo tester:
https://simongoring.shinyapps.io/RegularExpressionR/

What does the language look like?

The language is based on meta-characters which have a special meaning rather than their literal meaning. For example, '\$' is used to match the end of a string, and this use supersedes its use as a character in a string (i.e. 'Joe paid \$2.99 for chips.').


2.1.0 Classes

What kind of character is it?

Expression Meaning
\w, [A-z, 0-9], [[:alnum:]] word characters (letters + digits)
\d, [0-9], [[:digit:]] digits
[A-z], [[:alpha:]] alphabetical characters
\s, [[:space:]] space
[[:punct:]] punctuation
[[:lower:]] lowercase
[[:upper:]] uppercase
\W, [^A-z0-9] not word characters
\S not space
\D, [^0-9] not digits

Note that some of these are not universal but rather specific to POSIX bracket expression ie [:xx:] and must be used within a secondary set of brackets to be valid. This kind of syntax is compatible with many regex systems including R and Unix.


2.2.0 Quantifiers

How many times will a character appear?

Expression Meaning
? 0 or 1 occurrence
* 0 or more occurrences
+ 1 or more occurrences
{n} exactly n occurrences
{n,} at least n occurrences
{,n} at most n occurrences
{n,m} between n and m occurrences (inclusive)

2.3.0 Operators

Helper actions to match your characters.

Expression Meaning
| or
. matches any single character
[ ... ] matches ANY of the characters inside the brackets
[ ... - ... ] matches a RANGE of characters inside the brackets
[ ^... ] matches any character EXCEPT those inside the bracket
( ... ) grouping - used for [backreferencing] (https://www.regular-expressions.info/backref.html)

2.4.0 Matching by position

Where is the character in the string?

Expression Meaning
^ start of the string
$ end of the string
\\b empty string at either edge of the word
\\B empty string that is NOT at the edge of a word

2.5.0 Escape characters

Sometimes a meta-character is just a character. Escape sequences (\) allows you to use a character 'as is' rather than its special RegEx function. In R, RegEx are evaluated as strings first, then as a regular expression unless we specify we are writing a regular expression. A backslash is used to escape each RegEx character in the string, so often we will need 2 backslashes to escape, say, a \$ expression: One backslash (escape character) per character, so at the end we end up with a RegEx that looks like \\\$.

Expression Meaning
\ escape for meta-characters to be used as characters (*, $, ^, ., ?, |, \, [, ], {, }, (, )). Note: the backslash is also a meta-character.

2.6.0 Trouble-shooting RegEx

Trouble-shooting with escaping meta-characters means adding backslashes until something works.

backslashes.png

While you can always refer back to this lesson for making your regular expressions, you can also use this RegEx cheatsheet.

Google.regex.jpg


3.0.0 Online ReGex exercise

We will kick of this RegEx lecture by playing around with an online tool: https://regexr.com/.

3.1.0 Breaking down a regular expression example

In the text section of the online tool, replace the default text by:

>NP_001009020.1 forkhead box protein P2 [Pan troglodytes] MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQT SGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYK

We are going to explore the following expression by breaking it down into its components: ^>(\w+)\.(\d)\s(.+)\[(\w+)\s(\w+).*\]

What do you think it matches in our text?

3.1.1 The breakdown:

Expression Meaning
^ Beginning of a line
> Greater than symbol, used to designate the beginning of a new read in fasta format
\w Match a word (alphabetic) character (Remember that a second escape character (backslash) is required)
+ Match the preceding metacharacter more than once
. match a literal period
\d Match a digit
\s space
.+ anything, one or more times
\[ match a literal squared bracket
.* anything, as many times it is present

3.2.0 Introduction to string manipulation with stringr

Common uses of string manipulation are: Searching, replacing or removing (making substitutions), and splitting and combining substrings.

Base R offers a variety of built-in RegEx functions such as grep(), grepl(), and gsub() that are common to other programming languages. Even though these functions are computationally very efficient, it can be very challenging to master their use. Tidyverse's stringr package offers a more comprehensive, user-friendly set of RegEx-compatible functions that are easier to work with for those unfamiliar with regular expressions. Thus, we will stick to stringr to get some hands-on RegEx experience.

As an example, we are going to play with a string of DNA.

This piece of DNA is from the book Jurassic park, and was supposed to be dinosaur DNA, but is actually just a cloning vector. Bummer.


</br>

</br>


3.2.1 RStudio's find-replace

RStudio has its own RegEx-compatible find-replace functionality in its graphic user interface (GUI) that you can use by hitting ctr+f. As a quick example, let's find out where in this RMarkdown we can find the string GCGTTGCTGGCG. You can also check other boxes if your search is case sensitive or if you are looking for whole words. A similar option exists for Jupyter notebooks as well under Edit > Find and Replace in the menu settings.


3.2.2 Remove unwanted text with str_remove() and str_remove_all()

Our string "dino" is in FASTA format, but we don't need the header; we just want to deal with the DNA sequence. The header begins with '>' and ends with a number, '1200', with a space between the header and the sequence. Let's practice capturing each of these parts of a string, and then we'll make a regular expression to remove the entire header.

All stringr functions take in as arguments the string you are manipulating and the pattern you are capturing. str_remove replaces the matched pattern with an empty character string "". In our first search we remove '>' from our string, dino.

Next we can search for numbers. The expression '[0-9]' is looking for any number. Always make sure to check that the pattern you are using gives you the output you expect.

3.2.2.1 str_remove() replaces the first instance of a search string versus str_remove_all()

Why aren't all of the numbers replaced? str_remove only replaces the first match in a character string. Switching to str_remove_all replaces all instances of numbers in the character string.

How do we capture spaces? The pattern '\s' replaces a space. However, for the backslash to not be used as an escape character (its special function), we need to add another backslash, making our pattern '\\s'. In other words, you need to escape the backslash itself with another backslash.

To remove the entire header, we need to combine these patterns. The header is everything in between '>' and the number '1200' followed by a space. The operator . captures any single character and the quantifier * matches it any number of times (including zero).


3.2.3 Greedy vs. lazy matching

>DinoDNA from Crichton JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCG...

You may have noticed that we also have a number followed by a space earlier in the header, '103 '. Why didn't the replacement end at that first match? The first instance is an example of greedy matching - it will capture the longest possible string.

To curtail this behavior and use lazy matching - the shortest possible string - you can add the ? quantifier. Remember that this character can signify looking for 0 or 1 occurences of a pattern.

In this case, we are going to use it to make the preceding quantifier * lazy by causing it to match as few character as possible.

Because of the concepts of greediness and lazyness, stringr has seemingly redundant functions such as str_extract() and str_extract_all(), or str_view() and str_view_all(), among others.

In this case, we want the greedy matching to replace the entire header. Let's save the dna sequence into its own object.

Now that we're left with the header-less DNA sequence, save it to a new variable called dna


3.3.0 Extracting to save for later with str_extract()

We may also want to retain our header in a separate string. str_extract() Will retain the string that matches our pattern instead of removing it. We can save this in an object called header. Note that we have removed the final space \\s from our expression.


3.4.0 Searching with str_extract_all()

Now we can look for patterns in our (dino) DNA!

Does this DNA have balanced GC content? We can use str_extract_all to capture every character that is either a G or a C.

The output is a list object in which is stored an entry for each G or C extracted. We count the number of occurrences of G and C using str_count and divide by the total number of characters in our string to get the %GC content.


3.5.0 Replacement using str_replace_all()

Let's translate this into mRNA!

To replace multiple patterns at once, a character vector is supplied to str_replace_all() of patterns and their matched replacements. This allows us to perform multiple replacements multiple times.


3.6.0 Useful search tools powered by RegEx

How do we query our sequence for the presence of specific patterns or motifs?

3.6.1 Simple questions with str_detect()

Is there even a start codon in this sequence? str_detect can be used to get a local (TRUE or FALSE) answer to whether or not a match is found.


3.6.2 Counting pattern matches with str_count()

It might be more useful to know exactly how many possible start codons we have. str_count will count the number of matches in the string for our pattern.


3.6.3 Locating pattern matches with str_locate()

To get the position of a possible start codon we can use str_locate, which will return the indices (coordinates) of where the start and end of our FIRST substring occurs.

str_locate_all can be used to find the all possible locations.


3.7.0 Splitting up our string with str_sub()

Let's split this string into substrings of codons, starting at the position of our start codon. We have the position of our start codon from str_locate. We can use str_sub to subset the string by position (we will just go to the end of the string for now).

str_sub(string, start, end) where


3.8.0 Applying our RegEx skill further

3.8.1 Generate codons from the first open reading frame with a ...

We can get codons by extracting groups of (any) 3 nucleotides/characters in our reading frame.


3.8.1.1 Remember some functions return a list as output.

The codons are extracted into a list, but we can get our character substrings using unlist().


3.8.2 Search for multiple specific patterns using |

We now have a vector with 370 codons.

Do we have a stop codon in our reading frame? Let's check with str_detect. We can use round brackets ( ) to separately group the different stop codons.


3.8.3 Identify specific occurrences in a vector with which()

Looks like we have many matches. We can subset the codons using str_detect (instances where the presence of a stop codon is equal to TRUE) to see which stop codons are represented.

Recall from Lecture 01 that we can use the which function to find which indices the stop codons are positioned at. The call comes in the format of which(data_vector LOGICAL/BOOLEAN desired_value) where:

Let's subset codons to end at the first stop codon.


3.8.4 Replace your codons with amino acids

After finding our unique codons, we can translate codons into their respective proteins by using str_replace_all using multiple patterns and replacements as before.


3.8.5 Collapsing your results into a single string with str_flatten()

What is our final protein string? str_flatten allows us to collapse our individual protein strings into one long string.


3.8.6 Recombining strings with str_c()

We can add our header back using str_c, which allows us to combine strings. We can use a space to separate our original strings.


4.0.0 Multi FASTA file formatting

In next exercise, you will obtain a multifasta file and convert it into a csv file that can be viewed in spreadsheet software such as MS Excel. Here are the instructions:

4.0.1 If you wanted to pull down your own version of the data set

Luckily for you, the data is already in your folders but if you wanted to get the data yourself, it would go something like this.


4.1.0 Reorganizing a multi-entry fasta file

Goal: Reorganize the data from a fasta format into a spreasheet format with one row per entry (gene) and the following columns:

For example, given the entry

>NP_001009020.1 forkhead box protein P2 [Pan troglodytes] MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQT SGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYK

The expected 5-part output would be

"NP_001009020.1" "forkhead box protein P2 FoxP2" "Pan" "troglodytes" "MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYK"

Once the file is in the desired format, write the file to disk in csv format.

In MS Excel, it should be organized something like

Accession_number Protein_info Genus Species Sequence
NP_001009020.1 forkhead box protein P2 Pan troglodytes MMQESATETISNSSMNQNGMST...

Let's do it!

4.1.1 Read in your file as a single string using readr::read_file()

First, we need to read in the file that we just downloaded.


4.1.2 Explore your data to understand its structure

Before we start manipulating our data, let's inspect fasta. You'll notice from above that there are \r\n characters interspersed throughout the string which are translated as line breaks.

4.1.2.1 Use writelines() or cat() to see your output in a human-readable format

Right now our data is in a character vector but it's just a single long character. We can't even see it properly withstr() or glimpse(). We can try the print() function but it won't account for line breaks. The writelines() function, however, will allow us to see the data in a more understandable format. The cat() function will also interpret \n characters but it technically also puts multiple entries from a vector together with a sep= parameter.


4.1.3 How many entries do we have?

Given that we know that > represents the beginning on a line in fasta format, we can count the number of > as a proxy for total entries.

It looks like we have 130 entries. It's actually 129 but we'll get to that problem eventually. Time to work in our tasks!


4.2.0 Order of operations

  1. Remove the square brackets we see for each species ie [Homo sapiens]
  2. Split each entry into separate rows
  3. Remove the > symbol from each entry
  4. Split each entry into a "header" and "sequence" portion
  5. Further subdivide the header into the components we want
  6. Clean up the sequences so they are each single strings uninterrupted by line breaks

4.2.1 Strip out and replace all occurences of [ and ]

The best way to go through the entire string is to use str_replace_all()


4.2.2 Split each entry into an individual row

Each entry is separated by a few characters. Taking a closer look at the end of each entry we see ...PELEDDREIEEEPLSEDLE\r\n\r\n>NP_683697.2.

We can use str_split() but what pattern should we give it?

str_split() returns a list but we want to work with a vector! Let's quickly fix that.


4.2.3 Remove the > from the start of each entry

Remember each entry is now separately stored in a character vector. We want to go through each and remove the > symbol. To accomplish this we will use str_replace_all which expects a character vector as input. Perfect!


4.2.4 Split your entries into a header and sequence with str_replace_all()

Now we want to break away our header from the rest of the actual amino acid sequence in each entry. To accomplish this we'll use str_split_fixed(string, pattern, n) where:


4.2.4.1 Remove duplicated() entries with the help of which()

Okay, a quick side trip to remind/show you some of the tricks you can use in your data wrangling. Are there duplicated entries in our original fasta file? Perhaps there are unique entry headers but sequences may be duplicated. Let's cull them from the set for now.

How do we check which entries are duplicates of sequences already present in fasta5?


4.2.4.1 Remove duplicated() entries with the help of filter()

Rather than use the more complicated code above, we can rely on dplyr to help us out with a call to filter instead. In this case we'll want to remember that we want to filter for the non-duplicated entries. We can accomplish this with the logical not ! annotation.

Put it into something we'll call header_seq_subset


4.2.5 Break up the header column into the components using str_match_all()

Let's review. We now have a filtered data frame consisting of 2 columns. The first represents the headers from each fasta entry, and the second column contains the sequences of the fasta entry. Let's break our header into the components we wanted:

We'll accomplish this with str_match_all() which takes in our string and returns a list of our matched groups in a vectorised format. The key here is in the pattern where each group is defined by the matching pattern within each set of (parentheses). Let's keep in mind that the return output of str_match_all() will be a matrix where the first column has the complete match, and each additional column is reserved for matching groups as defined above.

In this case we are also going to take advantage of greedy matching to capture protein_info which may take any non-uniform format!

XP_007980836.1 PREDICTED: forkhead box protein P2 isoform X2 Chlorocebus sabaeus

Let's break this header down into some subcomponents which are all separated by whitespace

Text Entry Properties
XP_007980836.1 alphanumeric series, separated by ".", ending with digits
PREDICTED: forkhead box protein P2 isoform X2 alphanumeric with any number of words and spaces present
Chlorocebus single alphabetical word
sabaeus single alphabetical word

4.2.6 Reformat our header information into a data frame

We now have a list of 1x5 matrices where we only car about the information in column 2-5 of each matrix. How do we pull that information from each entry of the list itself? Currently fasta_header exists as a list where each element is a matrix. Let's convince ourselves in the following code cell.

How do we approach converting this to something like a data frame? There are a number of ways to do this and we'll show you a couple.

  1. Remove the unwanted entry as we build the data frame
  2. Convert the list to a data frame and remove the unwanted columns

4.2.6.1 A note about working with data frames and rbind()

There are a number of ways to add a row to a data frame. What must always be remembered is that to add a row, two criteria must be met:

  1. The number of columns must match
  2. The names of the columns must match

Otherwise, the program may not stop but it will certainly not give you the output you were expecting.

A great way to add rows is with the rbind() command. Behind the scenes, when we call this command, the interpreter evaluates the inputs for the call and decides on which implementation of rbind() to use i.e. is this for a data frame? or a matrix? etc. In our case, we will explicitly call on rbind.data.frame() to ensure we get a data.frame as a result.

4.2.6.2 Use a loop to build your data frame one row at a time

We want to repeat the above command for every entry within fasta_header. We haven't discussed loops yet, but will delve into these control structures in a future lecture. Here, however, we want to quickly use it to accomplish a repetitive action. Remember our rules for adding rows to a data frame!

4.2.6.3 do.call() to save youself some trouble

The do.call() function "constructs and executes a function call from a name or a function and a list of arguments to be passed to it." It takes the form of:

do.call(what, args, quote=FALSE, envir = parent.frame())

where

How does this help us? Although not exactly the same, it works very similary to a lapply() function where we can provide a list and it will vectorize a function over that list. The distinction, however, is under the hood where lapply() will apply the same function to each element of a list, do.call will supply a list of arguments just one time to a function call. Furthermore, lapply() always returns a list versus do.call() which will return an object based on the function called.

Overall, that means we can do this...

Don't forget to rename your columns!

4.2.7 Reformat our sequence entries with str_*() methods

Our sequence entries in header_seq_subset are full of line breaks in the form of \r\n and we don't want these anymore. So how can we go about removing those easily?

There are a number of approaches we could take but let's use the tools we've already encountered:

  1. str_split()
  2. str_flatten()

4.2.8 Adding columns to your data frame - a tool for each purpose

It's the last step. We want to add our final sequence information to header.df but there are a number of ways to perform this last step based on what we've learned. Remember we also want to have the new column named "Sequence"

  1. A for loop... but that's more coding than we need
  2. cbind() can add to our data frame
  3. Using $ to create a new variable/column directly in header.df
  4. Use a dplyr verb to create a new column

Let's try some of them!

5.0.0 Resources

http://stat545.com/block022_regular-expression.html
http://stat545.com/block027_regular-expressions.html
http://stat545.com/block028_character-data.html
http://r4ds.had.co.nz/strings.html http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/
https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054 >
http://emailregex.com/
https://regex101.com/
https://regexr.com/
https://www.regular-expressions.info/backref.html
https://www.regular-expressions.info/unicode.html
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
https://simongoring.shinyapps.io/RegularExpressionR/

5.1.0 Post-lesson assessment

Soon after the end of this lecture, a homework will be available for you on DataCamp. You will have until 12:00 hours on Thursday, February 18th, to submit your assignment (right before the next lecture). This is a pass-fail assignment based on completion of the material provided.

Also, see section 6.0.0 for an additional challenge.


Thanks for coming!!!

yougotthis.jpg


6.0.0 Challenge: a real messy dataset

I looked for a messy dataset for data cleaning and found it in a blog titled:
"Biologists: this is why bioinformaticians hate you..."

Challenge:

This is Wellcome Trust APC dataset on the costs of open access publishing by providing article processing charge (APC) data.

https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054

The main and common issue with this dataset is that when data entry was done there was no structured vocabulary; people could type whatever they wanted into free text answer boxes instead of using dropdown menus with limited options, giving an error if something is formatted incorrectly, or stipulating some rules (i.e. must be all lowercase, uppercase, no numbers, spacing, etc).

What I want to know is:

  1. List 3 problems with this dataset that require data cleaning.
  2. What is the mean cost of publishing for the top 3 most popular publishers?
  3. What is the number of publications by PLOS One in dataset?
  4. Convert sterling to CAD. What is the median cost of publishing with Elsevier in CAD?

The route I suggest to take in answering these question is:

There is a README file to go with this spreadsheet if you have questions about the data fields.

</br>

The blogger's opinion of cleaning this dataset:

'I now have no hair left; I’ve torn it all out. My teeth are just stumps from excessive gnashing. My faith in humanity has been destroyed!'

Don't get to this point. The dataset doesn't need to be perfect. No datasets are 100% clean. Just do what you gotta do to answer these questions.

We can talk about how this went at the beginning of next week's lecture.


And we are done for the day! Well done!

Just for fun, here are some other uses for writeLines() and cat() functions